joined_all <- read_csv("../4_cleaned_data/daily_energy_weather_all.csv") %>%
mutate(yearseason = factor(yearseason, levels = c(
"Autumn 2011", "Winter 2011", "Spring 2012", "Summer 2012",
"Autumn 2012", "Winter 2012", "Spring 2013", "Summer 2013",
"Autumn 2013", "Winter 2013"
)))
Rows: 3505100 Columns: 21── Column specification ────────────────────────────────────────
Delimiter: ","
chr (6): lc_lid, month, season, yearseason, wday, temp_qual
dbl (10): kwh, cloud_cover, sunshine, global_radiation, max...
lgl (3): weekend, precipitation_tf, snow_tf
date (2): date, quarter
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(joined_all)
lc_lid date month season
Length:3505100 Min. :2011-12-01 Length:3505100 Length:3505100
Class :character 1st Qu.:2012-10-21 Class :character Class :character
Mode :character Median :2013-03-29 Mode :character Mode :character
Mean :2013-03-27
3rd Qu.:2013-09-10
Max. :2014-02-27
quarter yearseason wday weekend
Min. :2011-12-01 Spring 2013:496241 Length:3505100 Mode :logical
1st Qu.:2012-09-01 Winter 2012:495149 Class :character FALSE:2506363
Median :2013-03-01 Summer 2013:488253 Mode :character TRUE :998737
Mean :2013-02-09 Autumn 2012:475560
3rd Qu.:2013-09-01 Autumn 2013:473333
Max. :2013-12-01 Winter 2013:451590
(Other) :624974
kwh cloud_cover sunshine global_radiation max_temp
Min. : 0.000 Min. :0.000 Min. : 0.000 Min. : 12.0 Min. :-0.20
1st Qu.: 4.697 1st Qu.:3.000 1st Qu.: 0.400 1st Qu.: 36.0 1st Qu.:10.00
Median : 7.828 Median :5.000 Median : 3.100 Median : 81.0 Median :14.10
Mean : 10.144 Mean :4.716 Mean : 3.918 Mean :109.2 Mean :14.99
3rd Qu.: 12.586 3rd Qu.:7.000 3rd Qu.: 6.300 3rd Qu.:172.0 3rd Qu.:19.80
Max. :332.556 Max. :8.000 Max. :14.500 Max. :333.0 Max. :34.10
NA's :863
mean_temp min_temp precipitation_tf precipitation pressure
Min. :-2.60 Min. :-7.600 Mode :logical Min. : 0.000 Min. : 97900
1st Qu.: 6.80 1st Qu.: 3.200 FALSE:1533465 1st Qu.: 0.000 1st Qu.:100690
Median :10.50 Median : 7.400 TRUE :1971635 Median : 0.200 Median :101360
Mean :11.26 Mean : 7.468 Mean : 2.013 Mean :101298
3rd Qu.:15.90 3rd Qu.:11.900 3rd Qu.: 2.400 3rd Qu.:101990
Max. :25.40 Max. :20.700 Max. :29.800 Max. :104120
snow_tf snow_depth temp_qual
Mode :logical Min. :0.00000 Length:3505100
FALSE:3479083 1st Qu.:0.00000 Class :character
TRUE :26017 Median :0.00000 Mode :character
Mean :0.02418
3rd Qu.:0.00000
Max. :7.00000
joined_all %>%
distinct(lc_lid)
All 5,566 households. Remember some have very few data points, some have lots of zero values. <- include these measures in summary df so can eliminate based on rules.
hhold_summary <- joined_all %>%
mutate(zero_kwh = if_else(kwh == 0, T, F),
gt150_kwh = if_else(kwh >= 150, T, F)) %>%
group_by(lc_lid) %>%
summarise(num_values = n(),
num_zeros = sum(zero_kwh),
num_gt150 = sum(gt150_kwh),
min_value = min(kwh),
max_value = max(kwh),
range = (max(kwh) - min(kwh)),
median_kwh = median(kwh),
iqr = IQR(kwh),
mean_kwh = mean(kwh),
sd = sd(kwh)) %>%
mutate(pc_missing_days = (100 * (total_days - num_values) / total_days),
pc_zero_values = (100 * num_zeros / num_values),
pc_gt150 = (100 * num_gt150 / num_values)) %>%
ungroup() %>%
mutate(has_zeros = if_else(pc_zero_values > 0, T, F),
has_150kwh = if_else(pc_gt150 > 0, T, F))
hhold_summary %>%
filter(min_value > 0) %>%
arrange(min_value)
Summary:
# need to make sure yearseason is fctr with right levels...
joined_all %>%
group_by(yearseason, lc_lid) %>%
summarise(count = n()) %>%
ungroup() %>%
group_by(yearseason) %>%
summarise(num_hh = n(),
sum_values = sum(count),
min_hh = min(count),
max_hh = max(count)) %>%
arrange(yearseason)
`summarise()` has grouped output by 'yearseason'. You can override using the `.groups` argument.
Summer 2013 has 5,445 households - at least one with only 1 day counted Winter 2013 has 5,134 households - at least one with only 2 days counted
Do I need to make a subset of these? i.e. keep only households with >=49 (7 weeks’) values in both summer and winter 2013
total_days_summer2013 <- joined_all %>%
filter(yearseason == "Summer 2013") %>%
distinct(date) %>%
nrow()
total_days_winter2013 <- joined_all %>%
filter(yearseason == "Winter 2013") %>%
distinct(date) %>%
nrow()
total_days_summer2013
[1] 92
total_days_winter2013
[1] 89
So max number values is 92 days in summer 2013, 89 days in winter 2013
Cutoff for including a household? What’s the minimum number days we need to understand the summary metrics? What looks reasonable from viz?
joined_all %>%
filter(yearseason %in% c("Summer 2013", "Winter 2013")) %>%
group_by(lc_lid, yearseason) %>%
summarise(num_values = n()) %>%
ggplot() +
geom_histogram(aes(x = num_values, fill = yearseason), bins = 9, show.legend = FALSE) +
facet_wrap( ~ yearseason, ncol = 1)
`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
Most households have most of the days recorded
# zoom in on < 80
joined_all %>%
filter(yearseason %in% c("Summer 2013", "Winter 2013")) %>%
group_by(lc_lid, yearseason) %>%
summarise(num_values = n()) %>%
filter(num_values < 80) %>%
ggplot() +
geom_histogram(aes(x = num_values, fill = yearseason), bins = 40, show.legend = FALSE) +
facet_wrap( ~ yearseason, ncol = 1)
`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
~75 would pick up the top end of this plot
75/92
[1] 0.8152174
75/89
[1] 0.8426966
Or 85% of the timeframe
0.85*total_days_summer2013
[1] 78.2
0.85*total_days_winter2013
[1] 75.65
Let’s say 77 days (11 weeks) (~85% of each season)
joined_all %>%
filter(lc_lid %in% winter_summer_2013_hholds) %>%
filter(yearseason %in% c("Summer 2013", "Winter 2013")) %>%
group_by(yearseason, lc_lid) %>%
summarise(count = n()) %>%
ungroup() %>%
group_by(yearseason) %>%
summarise(num_hh = n())
`summarise()` has grouped output by 'yearseason'. You can override using the `.groups` argument.
n = 4,999 households in each season
How have hhold stats changed by filtering for Summer / Winter 2013 only?
hhold_stats_sw2013 <- summer_winter2013 %>%
mutate(zero_kwh = if_else(kwh == 0, T, F),
gt150_kwh = if_else(kwh >= 150, T, F)) %>%
group_by(lc_lid, yearseason) %>%
summarise(num_values = n(),
num_zeros = sum(zero_kwh),
num_gt150 = sum(gt150_kwh),
min_value = min(kwh),
max_value = max(kwh),
range = (max(kwh) - min(kwh)),
median_kwh = median(kwh),
iqr = IQR(kwh),
mean_kwh = mean(kwh),
sd = sd(kwh)) %>%
mutate(pc_missing_days = (100 * (total_days - num_values) / total_days),
pc_zero_values = (100 * num_zeros / num_values),
pc_gt150 = (100 * num_gt150 / num_values)) %>%
ungroup() %>%
mutate(has_zeros = if_else(pc_zero_values > 0, T, F),
has_150kwh = if_else(pc_gt150 > 0, T, F))
`summarise()` has grouped output by 'lc_lid'. You can override using the `.groups` argument.
hhold_stats_sw2013
# check for zero values
hhold_stats_sw2013 %>%
filter(has_zeros) %>%
ggplot() +
geom_histogram(aes(x = pc_zero_values, fill = yearseason), bins = 50) +
facet_wrap(~ yearseason, ncol = 1)
Use summer and winter 2013 subset, with the 4,999 households with enough values in these seasons
Reduce to 1 row per hhold with key features
winter_weekend_mean <- summer_winter2013 %>%
filter(yearseason == "Winter 2013" & weekend) %>%
group_by(lc_lid) %>%
mutate(winter_wkend_mean = mean(kwh)) %>%
ungroup() %>%
select(lc_lid, winter_wkend_mean) %>%
unique()
winter_weekday_mean <- summer_winter2013 %>%
filter(yearseason == "Winter 2013" & !weekend) %>%
group_by(lc_lid) %>%
mutate(winter_wkday_mean = mean(kwh)) %>%
ungroup() %>%
select(lc_lid, winter_wkday_mean) %>%
unique()
summer_weekend_mean <- summer_winter2013 %>%
filter(yearseason == "Summer 2013" & weekend) %>%
group_by(lc_lid) %>%
mutate(summer_wkend_mean = mean(kwh)) %>%
ungroup() %>%
select(lc_lid, summer_wkend_mean) %>%
unique()
summer_weekday_mean <- summer_winter2013 %>%
filter(yearseason == "Summer 2013" & !weekend) %>%
group_by(lc_lid) %>%
mutate(summer_wkday_mean = mean(kwh)) %>%
ungroup() %>%
select(lc_lid, summer_wkday_mean) %>%
unique()
summer_wkend_chg <- left_join(summer_weekend_mean, summer_weekday_mean) %>%
mutate(summer_wkend_pc_change = (100 * (summer_wkend_mean - summer_wkday_mean) / summer_wkday_mean))
Joining with `by = join_by(lc_lid)`
summer_wkend_chg %>%
ggplot() +
geom_histogram(aes(x = summer_wkend_pc_change), bins = 30)
summer_wkend_chg %>%
filter(summer_wkend_pc_change > 200)
1 household with very extreme value here, causes by change between two very small values
Infer change = 0 if comparator mean values both < 0.5 kWh
winter_wkend_chg <- left_join(winter_weekend_mean, winter_weekday_mean) %>%
mutate(winter_wkend_pc_change = if_else(
winter_wkend_mean < 0.5 & winter_wkday_mean < 0.5, 0,
(100 * (winter_wkend_mean - winter_wkday_mean) / winter_wkday_mean)))
Joining with `by = join_by(lc_lid)`
winter_wkend_chg %>%
filter(winter_wkend_mean < 0.5 & winter_wkday_mean < 0.5)
#filter(winter_wkend_pc_change == 0)
# All 21 zero values are inferred, minimal disruption out of 4,999 households
winter_wkend_chg %>%
ggplot() +
geom_histogram(aes(x = winter_wkend_pc_change), bins = 30)
colnames(summer_winter2013)
[1] "lc_lid" "date" "month"
[4] "season" "quarter" "yearseason"
[7] "wday" "weekend" "kwh"
[10] "cloud_cover" "sunshine" "global_radiation"
[13] "max_temp" "mean_temp" "min_temp"
[16] "precipitation_tf" "precipitation" "pressure"
[19] "snow_tf" "snow_depth" "temp_qual"
# build summary table, type of join doesn't matter
df <- inner_join(summer_mean, summer_variance) %>%
inner_join(winter_mean) %>%
inner_join(winter_variance) %>%
inner_join(summer_wkend_chg) %>%
inner_join(winter_wkend_chg) %>%
mutate(winter_pc_change = 100 * (winter_mean_kwh - summer_mean_kwh) / summer_mean_kwh,
seasonal_rel_chg_in_sd = winter_sd / summer_sd)
Joining with `by = join_by(lc_lid)`Joining with `by = join_by(lc_lid)`Joining with `by = join_by(lc_lid)`Joining with `by = join_by(lc_lid)`Joining with `by = join_by(lc_lid)`
v_low_hhs <- df %>%
filter(winter_mean_kwh < 0.5 & summer_mean_kwh < 0.5) %>% # 11 households with both < 0.5
pull(lc_lid)
# filter(winter_mean_kwh < 0.5) # 22 households with < 0.5 in winter
# filter(summer_mean_kwh < 0.5) # 22 households with < 0.5 in summer
df_clean <- df %>%
filter(!lc_lid %in% v_low_hhs)
df_trim <- df_clean %>%
select(-c(summer_mean_kwh, summer_wkend_mean, summer_wkday_mean, winter_wkend_mean, winter_wkday_mean, winter_sd, summer_sd))
df_trim # 4,988 households
# mutate:
## seasonal diff ()
Remaining feature engineering to do, from weather_data_exploration (heading: “work out predictors…”)
something to do with temp – note potential alias with winter/summer things
something to do with sunshine – note potential alias with winter/summer things
mean of top 10 highest kwh – helps reduce effect of any really big values
mean/median(July 2013) - as most recent (and most number households)
mean/median(January 2014) - as most recent (and most number households)
mean of bottom 10 non-zero kwh – filter out 0s first
label households as high/average/low as to how their median kwh falls in with sample median – e.g. use 0-10%, 10-20%,20-50%, 50-80%, 80-90%, 90-100% deciles
Unsupervised, try:
See clustering: https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 for more info.
# columns for clustering, removed aliased
colnames(df_trim)
[1] "lc_lid" "winter_mean_kwh" "summer_wkend_pc_change" "winter_wkend_pc_change"
[5] "winter_pc_change" "seasonal_rel_chg_in_sd"
Steps for K-means:
library(cluster)
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(dendextend)
---------------------
Welcome to dendextend version 1.17.1
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags:
https://stackoverflow.com/questions/tagged/dendextend
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: ‘dendextend’
The following object is masked from ‘package:stats’:
cutree
library(corrplot)
# prep data for clustering
df_cluster <- df_trim %>%
# name the rows with household id
column_to_rownames("lc_lid") %>%
# scale the data (normal around mean, sd 0,1)
mutate(across(where(is.numeric), scale))
df_cluster
# old df, now update, do no re-run this!
# look at correlations
corrplot(cor(df_cluster), method = "number", type = "lower")
Notes:
Two main correlations now are:
From previous df_cluster before further cleaning
colnames(df_clean) # with all the columns still
[1] "lc_lid" "summer_mean_kwh" "summer_sd" "winter_mean_kwh"
[5] "winter_sd" "summer_wkend_mean" "summer_wkday_mean" "summer_wkend_pc_change"
[9] "winter_wkend_mean" "winter_wkday_mean" "winter_wkend_pc_change" "winter_pc_change"
[13] "seasonal_rel_chg_in_sd"
df_clean %>%
filter(seasonal_rel_chg_in_sd < 100) %>%
ggplot() +
geom_point(aes(x = winter_pc_change, y = seasonal_rel_chg_in_sd))
Note there are some extreme values for both x and y, zooming in by filtering for lower seasonal sd change –> still see a positive correlation
Look at scaled values:
df_cluster %>%
filter(seasonal_rel_chg_in_sd < 1) %>%
ggplot() +
geom_point(aes(x = winter_pc_change, y = seasonal_rel_chg_in_sd))
df_clean %>%
ggplot() +
geom_point(aes(x = summer_wkend_pc_change, y = winter_wkend_pc_change))
df_clean %>%
filter(winter_pc_change <= 500) %>%
ggplot() +
geom_point(aes(x = winter_pc_change, y = winter_wkend_pc_change))
Cluster not centred on 0 for winter_pc_change, but is on winter_wkend change = 0 - so some households change for winter but not on weekday/weekend schedule
Some don’t change much for winter but change lots for weekday/weekend schedule
Some change lots for winter, but not on weekday/weekend schedule
So possible clusters (types of households) here.
clustered_hholds <- kmeans(df_cluster,
centers = 6,
iter.max = 25,
nstart = 25)
clustered_hholds
K-means clustering with 6 clusters of sizes 494, 244, 406, 1333, 2510, 1
Cluster means:
winter_mean_kwh summer_wkend_pc_change winter_wkend_pc_change winter_pc_change seasonal_rel_chg_in_sd
1 2.24391197 -0.2303654 -0.2882549 -0.01565763 -0.01330152
2 -0.26737778 2.3257502 2.4201417 -0.01623500 -0.02207425
3 -0.30996691 -1.5145509 -1.4452446 0.01057095 0.06590078
4 -0.09918387 0.6539218 0.6468119 -0.01624156 -0.02273681
5 -0.31267735 -0.2829311 -0.2885620 -0.01625076 -0.01858776
6 -0.37352219 -0.2955802 0.7430496 69.84380438 62.16479688
Clustering vector:
MAC000002 MAC000003 MAC000004 MAC000005 MAC000006 MAC000007 MAC000009 MAC000010 MAC000011 MAC000012
4 5 5 5 3 4 5 1 5 5
MAC000013 MAC000014 MAC000015 MAC000017 MAC000018 MAC000019 MAC000020 MAC000021 MAC000022 MAC000023
5 5 5 5 5 4 5 3 5 5
MAC000024 MAC000025 MAC000026 MAC000027 MAC000029 MAC000030 MAC000031 MAC000032 MAC000033 MAC000034
1 5 3 4 5 2 1 5 5 1
MAC000035 MAC000036 MAC000038 MAC000039 MAC000040 MAC000041 MAC000042 MAC000043 MAC000044 MAC000045
1 5 2 5 1 5 4 5 4 1
MAC000046 MAC000047 MAC000048 MAC000049 MAC000051 MAC000052 MAC000053 MAC000054 MAC000055 MAC000056
4 5 5 1 5 4 5 4 5 5
MAC000057 MAC000058 MAC000059 MAC000060 MAC000061 MAC000062 MAC000064 MAC000066 MAC000067 MAC000068
5 3 5 4 4 5 4 5 4 1
MAC000069 MAC000070 MAC000072 MAC000073 MAC000074 MAC000075 MAC000076 MAC000077 MAC000078 MAC000079
5 4 2 5 5 5 4 5 3 1
MAC000081 MAC000082 MAC000083 MAC000084 MAC000085 MAC000086 MAC000087 MAC000088 MAC000089 MAC000090
4 3 2 4 1 4 4 4 3 4
MAC000091 MAC000092 MAC000093 MAC000094 MAC000096 MAC000097 MAC000098 MAC000099 MAC000100 MAC000101
5 3 5 5 1 5 5 5 4 5
MAC000102 MAC000104 MAC000106 MAC000107 MAC000108 MAC000109 MAC000110 MAC000111 MAC000112 MAC000113
5 4 5 5 5 5 4 5 5 1
MAC000115 MAC000116 MAC000117 MAC000118 MAC000119 MAC000121 MAC000122 MAC000123 MAC000124 MAC000125
4 1 5 4 5 5 5 5 5 2
MAC000126 MAC000127 MAC000128 MAC000129 MAC000130 MAC000131 MAC000132 MAC000133 MAC000137 MAC000138
5 5 2 3 5 5 4 4 4 3
MAC000140 MAC000141 MAC000143 MAC000145 MAC000147 MAC000148 MAC000149 MAC000150 MAC000151 MAC000152
3 5 5 5 4 5 5 2 5 5
MAC000153 MAC000155 MAC000156 MAC000157 MAC000158 MAC000159 MAC000160 MAC000162 MAC000163 MAC000165
1 4 5 2 5 5 4 5 4 4
MAC000166 MAC000167 MAC000168 MAC000169 MAC000170 MAC000171 MAC000173 MAC000174 MAC000175 MAC000176
5 4 5 4 5 2 5 1 5 5
MAC000177 MAC000178 MAC000179 MAC000180 MAC000181 MAC000183 MAC000185 MAC000186 MAC000187 MAC000188
5 5 4 4 5 5 2 5 4 5
MAC000189 MAC000190 MAC000191 MAC000192 MAC000193 MAC000194 MAC000195 MAC000198 MAC000199 MAC000200
1 5 1 4 5 5 3 5 4 4
MAC000201 MAC000203 MAC000205 MAC000206 MAC000207 MAC000208 MAC000209 MAC000210 MAC000211 MAC000212
5 5 5 4 4 4 5 5 5 5
MAC000213 MAC000214 MAC000215 MAC000216 MAC000217 MAC000218 MAC000219 MAC000220 MAC000221 MAC000222
5 4 5 1 5 5 5 4 5 3
MAC000226 MAC000227 MAC000228 MAC000229 MAC000230 MAC000231 MAC000232 MAC000233 MAC000234 MAC000236
5 5 3 5 5 5 5 5 5 5
MAC000237 MAC000238 MAC000239 MAC000242 MAC000243 MAC000244 MAC000245 MAC000246 MAC000247 MAC000248
1 2 5 5 2 5 1 4 5 5
MAC000249 MAC000250 MAC000251 MAC000252 MAC000253 MAC000254 MAC000255 MAC000256 MAC000257 MAC000258
1 4 4 1 2 2 5 1 1 5
MAC000259 MAC000260 MAC000261 MAC000262 MAC000263 MAC000264 MAC000265 MAC000266 MAC000267 MAC000268
4 3 5 1 4 5 5 3 5 5
MAC000269 MAC000270 MAC000271 MAC000272 MAC000273 MAC000274 MAC000275 MAC000276 MAC000277 MAC000278
5 4 5 4 4 1 3 4 5 4
MAC000279 MAC000280 MAC000281 MAC000282 MAC000283 MAC000285 MAC000286 MAC000288 MAC000289 MAC000290
5 4 2 5 5 5 4 5 2 5
MAC000291 MAC000292 MAC000293 MAC000294 MAC000295 MAC000296 MAC000297 MAC000298 MAC000299 MAC000300
3 5 5 5 4 1 3 4 5 3
MAC000301 MAC000302 MAC000303 MAC000304 MAC000305 MAC000306 MAC000307 MAC000308 MAC000309 MAC000310
2 5 5 3 5 3 1 5 1 4
MAC000311 MAC000312 MAC000313 MAC000314 MAC000315 MAC000316 MAC000317 MAC000318 MAC000319 MAC000320
4 1 5 5 5 2 5 5 4 2
MAC000321 MAC000322 MAC000323 MAC000324 MAC000326 MAC000327 MAC000328 MAC000329 MAC000330 MAC000331
1 5 5 3 3 4 5 5 5 1
MAC000332 MAC000333 MAC000334 MAC000337 MAC000339 MAC000340 MAC000341 MAC000342 MAC000343 MAC000344
5 5 5 5 4 5 4 5 5 5
MAC000345 MAC000346 MAC000347 MAC000348 MAC000349 MAC000350 MAC000351 MAC000352 MAC000353 MAC000354
3 5 4 5 4 5 5 4 5 1
MAC000355 MAC000357 MAC000359 MAC000360 MAC000361 MAC000362 MAC000363 MAC000364 MAC000365 MAC000366
5 4 2 4 2 4 5 4 5 5
MAC000367 MAC000368 MAC000369 MAC000370 MAC000371 MAC000372 MAC000373 MAC000374 MAC000375 MAC000376
5 4 5 5 5 3 4 5 5 5
MAC000377 MAC000378 MAC000380 MAC000381 MAC000382 MAC000383 MAC000384 MAC000385 MAC000386 MAC000387
5 5 5 4 4 5 5 3 5 5
MAC000388 MAC000389 MAC000390 MAC000391 MAC000392 MAC000394 MAC000395 MAC000396 MAC000397 MAC000398
5 3 5 4 5 5 5 3 5 5
MAC000400 MAC000401 MAC000402 MAC000403 MAC000404 MAC000405 MAC000406 MAC000407 MAC000408 MAC000409
3 5 5 5 3 5 5 5 5 1
MAC000411 MAC000412 MAC000413 MAC000414 MAC000415 MAC000416 MAC000417 MAC000418 MAC000419 MAC000420
5 5 3 5 1 5 5 4 5 1
MAC000421 MAC000422 MAC000423 MAC000424 MAC000425 MAC000426 MAC000427 MAC000428 MAC000429 MAC000430
5 5 5 5 4 4 4 5 5 2
MAC000431 MAC000433 MAC000434 MAC000435 MAC000436 MAC000437 MAC000439 MAC000440 MAC000441 MAC000442
1 5 5 5 5 1 5 4 5 5
MAC000443 MAC000444 MAC000445 MAC000446 MAC000447 MAC000448 MAC000449 MAC000451 MAC000452 MAC000453
1 5 3 4 5 5 4 1 1 5
MAC000454 MAC000455 MAC000457 MAC000459 MAC000461 MAC000462 MAC000464 MAC000465 MAC000466 MAC000468
4 1 5 5 4 5 5 5 4 4
MAC000470 MAC000471 MAC000472 MAC000473 MAC000474 MAC000475 MAC000476 MAC000477 MAC000478 MAC000479
5 5 4 5 5 1 2 4 5 4
MAC000480 MAC000481 MAC000482 MAC000483 MAC000484 MAC000485 MAC000487 MAC000488 MAC000489 MAC000490
1 5 4 5 2 5 5 5 5 4
MAC000492 MAC000493 MAC000496 MAC000497 MAC000498 MAC000499 MAC000500 MAC000501 MAC000502 MAC000503
4 2 5 4 4 5 5 3 4 4
MAC000504 MAC000505 MAC000506 MAC000507 MAC000508 MAC000509 MAC000510 MAC000511 MAC000513 MAC000514
4 4 5 5 5 2 5 5 3 5
MAC000515 MAC000516 MAC000518 MAC000519 MAC000520 MAC000521 MAC000522 MAC000523 MAC000524 MAC000525
5 3 5 4 3 5 1 5 1 5
MAC000526 MAC000527 MAC000529 MAC000530 MAC000531 MAC000532 MAC000533 MAC000535 MAC000536 MAC000538
5 4 5 1 4 4 5 3 5 5
MAC000539 MAC000541 MAC000542 MAC000543 MAC000544 MAC000547 MAC000548 MAC000549 MAC000550 MAC000551
3 3 5 5 4 1 1 5 1 5
MAC000552 MAC000553 MAC000554 MAC000555 MAC000556 MAC000557 MAC000558 MAC000559 MAC000560 MAC000561
2 4 5 1 5 1 5 4 4 5
MAC000562 MAC000563 MAC000564 MAC000565 MAC000566 MAC000568 MAC000569 MAC000570 MAC000571 MAC000572
5 5 5 5 4 5 1 5 5 4
MAC000573 MAC000574 MAC000575 MAC000576 MAC000577 MAC000578 MAC000579 MAC000580 MAC000581 MAC000582
4 4 5 5 4 5 4 4 5 4
MAC000584 MAC000585 MAC000586 MAC000587 MAC000588 MAC000589 MAC000590 MAC000591 MAC000592 MAC000593
3 4 3 5 4 5 5 5 2 5
MAC000594 MAC000595 MAC000596 MAC000597 MAC000598 MAC000599 MAC000600 MAC000601 MAC000603 MAC000604
4 4 1 5 5 4 5 5 5 5
MAC000605 MAC000608 MAC000609 MAC000610 MAC000611 MAC000612 MAC000613 MAC000614 MAC000615 MAC000616
3 5 5 5 5 4 5 2 5 5
MAC000617 MAC000618 MAC000620 MAC000621 MAC000622 MAC000623 MAC000624 MAC000625 MAC000626 MAC000627
5 3 4 1 5 4 4 2 3 4
MAC000628 MAC000629 MAC000630 MAC000631 MAC000632 MAC000633 MAC000634 MAC000635 MAC000637 MAC000638
1 5 4 5 5 4 5 4 2 5
MAC000639 MAC000640 MAC000641 MAC000642 MAC000643 MAC000644 MAC000645 MAC000646 MAC000647 MAC000648
4 3 1 5 2 5 5 4 5 4
MAC000649 MAC000650 MAC000651 MAC000652 MAC000653 MAC000654 MAC000655 MAC000656 MAC000657 MAC000658
5 4 5 5 5 1 5 3 5 5
MAC000659 MAC000660 MAC000661 MAC000662 MAC000663 MAC000664 MAC000665 MAC000667 MAC000669 MAC000670
4 5 1 4 5 1 5 5 5 4
MAC000671 MAC000672 MAC000673 MAC000674 MAC000675 MAC000676 MAC000677 MAC000678 MAC000679 MAC000680
4 5 5 5 2 4 5 5 5 5
MAC000681 MAC000682 MAC000683 MAC000684 MAC000685 MAC000686 MAC000687 MAC000688 MAC000689 MAC000690
5 3 5 5 3 5 5 5 1 5
MAC000691 MAC000692 MAC000693 MAC000694 MAC000695 MAC000696 MAC000697 MAC000698 MAC000699 MAC000701
5 5 1 3 5 5 1 5 5 5
MAC000702 MAC000703 MAC000704 MAC000705 MAC000706 MAC000707 MAC000708 MAC000709 MAC000710 MAC000711
5 5 4 5 2 3 5 5 5 5
MAC000712 MAC000713 MAC000714 MAC000715 MAC000716 MAC000717 MAC000718 MAC000719 MAC000720 MAC000721
5 5 5 1 5 2 5 5 5 5
MAC000722 MAC000723 MAC000724 MAC000725 MAC000726 MAC000727 MAC000728 MAC000730 MAC000731 MAC000732
4 2 5 4 5 5 5 4 2 4
MAC000734 MAC000736 MAC000737 MAC000739 MAC000740 MAC000742 MAC000743 MAC000744 MAC000745 MAC000746
5 4 1 5 4 5 5 5 5 3
MAC000747 MAC000748 MAC000749 MAC000750 MAC000751 MAC000752 MAC000753 MAC000754 MAC000755 MAC000756
5 3 5 4 5 5 3 5 3 5
MAC000757 MAC000759 MAC000761 MAC000762 MAC000763 MAC000764 MAC000765 MAC000766 MAC000767 MAC000768
2 5 5 5 5 5 5 4 3 4
MAC000769 MAC000771 MAC000772 MAC000773 MAC000775 MAC000776 MAC000777 MAC000778 MAC000779 MAC000780
4 5 5 4 5 5 3 1 2 1
MAC000781 MAC000782 MAC000783 MAC000784 MAC000785 MAC000786 MAC000787 MAC000789 MAC000790 MAC000791
3 5 5 4 4 4 4 5 5 4
MAC000792 MAC000793 MAC000794 MAC000795 MAC000796 MAC000798 MAC000799 MAC000800 MAC000801 MAC000802
4 5 4 4 5 5 5 4 4 5
MAC000803 MAC000804 MAC000805 MAC000806 MAC000807 MAC000808 MAC000809 MAC000810 MAC000811 MAC000812
3 2 5 1 4 1 5 5 1 5
MAC000814 MAC000815 MAC000816 MAC000817 MAC000819 MAC000820 MAC000821 MAC000822 MAC000823 MAC000824
5 5 4 1 3 5 3 2 4 5
MAC000825 MAC000826 MAC000827 MAC000828 MAC000829 MAC000830 MAC000831 MAC000832 MAC000833 MAC000834
4 3 4 1 3 4 5 4 5 5
MAC000835 MAC000836 MAC000837 MAC000838 MAC000839 MAC000840 MAC000841 MAC000842 MAC000843 MAC000844
3 5 5 5 5 3 5 5 4 4
MAC000845 MAC000846 MAC000847 MAC000848 MAC000849 MAC000850 MAC000851 MAC000852 MAC000853 MAC000854
4 2 5 1 2 5 4 5 5 5
MAC000855 MAC000856 MAC000857 MAC000858 MAC000859 MAC000860 MAC000861 MAC000862 MAC000864 MAC000866
1 4 5 4 5 5 3 4 5 5
MAC000867 MAC000868 MAC000869 MAC000870 MAC000871 MAC000872 MAC000873 MAC000874 MAC000875 MAC000876
4 5 4 5 5 5 5 5 5 5
MAC000877 MAC000878 MAC000880 MAC000882 MAC000884 MAC000885 MAC000886 MAC000887 MAC000888 MAC000889
4 4 5 4 4 1 5 5 4 1
MAC000890 MAC000891 MAC000892 MAC000893 MAC000894 MAC000895 MAC000896 MAC000897 MAC000899 MAC000900
4 5 1 5 2 2 5 4 1 5
MAC000901 MAC000902 MAC000903 MAC000904 MAC000905 MAC000906 MAC000907 MAC000908 MAC000909 MAC000910
5 2 5 5 3 2 2 1 5 5
MAC000911 MAC000913 MAC000914 MAC000915 MAC000916 MAC000917 MAC000918 MAC000919 MAC000920 MAC000921
5 5 5 3 1 4 3 5 5 5
MAC000922 MAC000923 MAC000924 MAC000926 MAC000927 MAC000928 MAC000929 MAC000930 MAC000931 MAC000933
2 5 4 5 5 2 4 4 4 5
MAC000934 MAC000935 MAC000936 MAC000937 MAC000938 MAC000939 MAC000940 MAC000941 MAC000942 MAC000943
4 4 5 5 5 3 5 4 4 5
MAC000944 MAC000945 MAC000947 MAC000948 MAC000949 MAC000950 MAC000953 MAC000954 MAC000955 MAC000956
4 2 5 5 5 5 5 4 5 3
MAC000957 MAC000958 MAC000959 MAC000961 MAC000962 MAC000963 MAC000965 MAC000966 MAC000967 MAC000968
1 5 4 5 5 4 4 1 4 2
MAC000969 MAC000970 MAC000971 MAC000972 MAC000973 MAC000974 MAC000975 MAC000976 MAC000979 MAC000980
5 5 2 5 5 5 5 4 5 2
MAC000981 MAC000982 MAC000984 MAC000985 MAC000986 MAC000987 MAC000989 MAC000990 MAC000991 MAC000992
5 3 5 1 1 5 5 5 5 5
MAC000993 MAC000994 MAC000995 MAC000996 MAC000997 MAC000999 MAC001000 MAC001001 MAC001002 MAC001003
4 3 5 4 4 5 4 5 5 5
MAC001004 MAC001006 MAC001007 MAC001008 MAC001009 MAC001010 MAC001012 MAC001013 MAC001014 MAC001015
1 5 5 3 1 5 5 5 1 5
MAC001016 MAC001017 MAC001018 MAC001019 MAC001021 MAC001022 MAC001023 MAC001024 MAC001025 MAC001026
4 5 5 5 5 5 5 5 5 5
MAC001027 MAC001029 MAC001030 MAC001031 MAC001032 MAC001033 MAC001034 MAC001035 MAC001036 MAC001037
5 4 3 5 1 5 5 5 1 3
MAC001038 MAC001039 MAC001040 MAC001041 MAC001042 MAC001043 MAC001044 MAC001047 MAC001048 MAC001049
5 4 5 5 5 5 1 5 4 5
MAC001050 MAC001052 MAC001053 MAC001054 MAC001055 MAC001056 MAC001057 MAC001058 MAC001059 MAC001060
5 4 5 4 5 5 5 5 5 5
MAC001061 MAC001062 MAC001063 MAC001066 MAC001067 MAC001068 MAC001069 MAC001070 MAC001071 MAC001072
5 4 5 5 5 5 3 2 5 5
MAC001073 MAC001075 MAC001076 MAC001077 MAC001078 MAC001079 MAC001081 MAC001083 MAC001084 MAC001085
4 5 4 5 1 5 4 5 5 1
MAC001086 MAC001087 MAC001088 MAC001089 MAC001090 MAC001091 MAC001092 MAC001093 MAC001094 MAC001095
4 4 5 5 5 5 3 5 5 1
MAC001096 MAC001097 MAC001098 MAC001099 MAC001100 MAC001102 MAC001103 MAC001104 MAC001105 MAC001106
5 5 1 1 5 4 5 5 5 5
MAC001107 MAC001108 MAC001109 MAC001110 MAC001111 MAC001112 MAC001113 MAC001114 MAC001117 MAC001118
3 5 5 4 4 5 2 5 5 5
MAC001119 MAC001120 MAC001121 MAC001123 MAC001124 MAC001125 MAC001126 MAC001127 MAC001128 MAC001129
5 5 4 5 4 4 5 4 5 5
[ reached getOption("max.print") -- omitted 3988 entries ]
Within cluster sum of squares by cluster:
[1] 1364.9832 912.9073 2010.2263 1365.9295 1597.2958 0.0000
(between_SS / total_SS = 70.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
# tidy tells us stats: here size of the clusters and mean (check?) values for our variables
broom::tidy(clustered_hholds,
col.names = colnames(df_cluster)) # don't actually need col.names
# glance gives us sum-squared (ss) and tots withins values
broom::glance(clustered_hholds, col.names = colnames(df_cluster))
# augment tells which rownames assigned to which cluster
broom::augment(clustered_hholds, col.names = colnames(df_cluster), data = df_cluster)
6 clusters are interesting (remember these are scaled values, so -ve doesn’t mean -ve, just means less than the mean (centred at 0)):
Clustering does seem to make some sense here!
library(broom)
max_k <- 10
k_clusters <- tibble(k = 1:max_k) %>%
mutate(kclust = map(k, ~ kmeans(df_cluster, .x, iter.max = 15, nstart = 25)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, df_cluster))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `kclust = map(k, ~kmeans(df_cluster, .x, iter.max = 15, nstart = 25))`.
Caused by warning:
! Quick-TRANSfer stage steps exceeded maximum (= 249400)
k_clusters
clustering <- k_clusters %>%
unnest(glanced)
clustering
ggplot(clustering, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(1, 20, by = 1))
Elbow at k = 4
Note the goal here is to find the minimum tot.withinss but with the fewest clusters possible - it’s a cost function to find the best gain (here, loss!) in tot.withinss at the least cost in adding k.
# from factoextra, requires also package "ggsignif", "rstatix"
library(ggsignif)
library(rstatix)
Attaching package: ‘rstatix’
The following object is masked from ‘package:stats’:
filter
fviz_nbclust(df_cluster,
kmeans,
method = "silhouette",
nstart = 25)
Warning: did not converge in 10 iterations
this suggests optimal k is 3
Note: silhouette method gives a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation) – https://towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891
fviz_nbclust(df_cluster,
kmeans,
method = "gap_stat",
nstart = 25,
k.max = 15)
# use verbose = FALSE to suppress progress messages
Note lots of warnings that clusters did not converge in 10 iterations, unable to find argument to ask function to do more iterations.
This takes a long time - gap stat not so good for large dataset?
Check for k = 3, k = 4 – very different answers for elbow and silhouette methods so look at the data and decide
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 4) %>%
ggplot(aes(x = summer_wkend_pc_change, y = winter_wkend_pc_change)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "goldenrod1", "deeppink1", "lightgreen")) +
scale_fill_manual(values = c("mediumblue", "goldenrod1", "deeppink1", "lightgreen"))
Looking at wkend change in both seasons - clusters 2 and 4 separate this in half, with cluster 2 being higher changers, cluster 4 being lower changers; can see some cluster 1 (especially overlapping with cluster 4) but can’t see cluster 3 here
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 4) %>%
#filter(seasonal_rel_chg_in_sd < 1) %>% # zoom in to bottom-left group
ggplot(aes(x = winter_pc_change, y = seasonal_rel_chg_in_sd)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "goldenrod1", "deeppink1", "lightgreen")) +
scale_fill_manual(values = c("mediumblue", "goldenrod1", "deeppink1", "lightgreen"))
Here, cluster 2 and 4 are at lower end – lower winter_pc_change and low/verage seasonal_rel_change_in_sd
Cluster 3 is one extreme point with very high winter_pc_change and seasonal_rel_chg_in_sd
k4_means <- k_clusters %>%
unnest(tidied) %>%
filter(k==4) %>%
select(-c(kclust, glanced, augmented, k)) %>%
relocate(cluster, .before = winter_mean_kwh) %>%
relocate(size, .after = cluster)
k4_means
Overall, with 4 clusters, we have:
Can we exclude the household in cluster 3 as unusual? Help us to see the other clusters better?
First check k = 3:
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
ggplot(aes(x = summer_wkend_pc_change, y = winter_wkend_pc_change)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "goldenrod1", "deeppink1")) +
scale_fill_manual(values = c("mediumblue", "goldenrod1", "deeppink1"))
Similar to above, clusters 2 and 3 are lower changers, cluster 1 are higher changers
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
filter(seasonal_rel_chg_in_sd < 1) %>% # zoom in to bottom-left group
ggplot(aes(x = winter_pc_change, y = seasonal_rel_chg_in_sd)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "goldenrod1", "deeppink1")) +
scale_fill_manual(values = c("mediumblue", "goldenrod1", "deeppink1"))
Extreme value is in cluster 2 (high changers) - see unzoomed version
Some separation pink v yellow (clusters 2 v 3) along this axis
k3_means <- k_clusters %>%
unnest(tidied) %>%
filter(k==3) %>%
select(-c(kclust, glanced, augmented, k)) %>%
relocate(cluster, .before = winter_mean_kwh) %>%
relocate(size, .after = cluster)
k3_means
Overall, with 3 clusters, we have:
Same type of clusters in both – I think worthwhile to remove the outliers by checking the maths on my summary stats, and redo to see if clustering becomes more optimal
view(skimr::skim(df_clean))
winter_pc_change is wild - large sd, large mean, large outlier max value
Calculated as
100 * (winter_mean_kwh - summer_mean_kwh) / summer_mean_kwh
Need to reframe to catch 0.000001 / 0.01 which is effectively 0 / 0 but will come out as 1000s
–> do this in new notebook: “K means clustering”
outlier
[1] "MAC004735"
# remove outlier from input daa, before scaling!
df_trim2 <- df_trim %>%
filter(lc_lid != outlier)
df_trim2
4,987 households now
# prep data for clustering
df_cluster2 <- df_trim2 %>%
# name the rows with household id
column_to_rownames("lc_lid") %>%
# scale the data (normal around mean, sd 0,1)
mutate(across(where(is.numeric), scale))
df_cluster2
# old df, now update, do no re-run this!
# look at correlations
corrplot(cor(df_cluster2), method = "number", type = "lower")
Still same/similar correlations as before
max_k <- 10
k_clusters <- tibble(k = 1:max_k) %>%
mutate(kclust = map(k, ~ kmeans(df_cluster2, .x, iter.max = 15, nstart = 25)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, df_cluster2))
k_clusters
clustering <- k_clusters %>%
unnest(glanced)
clustering
ggplot(clustering, aes(x = k, y = tot.withinss)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(1, 20, by = 1))
Very weird behaviour here… k = 3 or 5 may be optimal (low tot.withinss) and NOT k=4
Note the goal here is to find the minimum tot.withinss but with the fewest clusters possible - it’s a cost function to find the best gain (here, loss!) in tot.withinss at the least cost in adding k.
fviz_nbclust(df_cluster2,
kmeans,
method = "silhouette",
nstart = 25)
Warning: Quick-TRANSfer stage steps exceeded maximum (= 249350)Warning: did not converge in 10 iterationsWarning: did not converge in 10 iterationsWarning: did not converge in 10 iterations
this suggests optimal k is 3 or 4.
Note: silhouette method gives a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation) – https://towardsdatascience.com/silhouette-method-better-than-elbow-method-to-find-optimal-clusters-378d62ff6891
Taken together, both elbow and silhouette suggest looking at k = 3 clusters would be optimal
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
ggplot(aes(x = summer_wkend_pc_change, y = winter_wkend_pc_change)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "springgreen2", "deeppink1")) +
scale_fill_manual(values = c("mediumblue", "springgreen2", "deeppink1"))
Similar pattern to above, here cluster 1 is low changers, cluster 3 is high changers
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
filter(seasonal_rel_chg_in_sd < 1) %>% # zoom in to bottom-left group
ggplot(aes(x = winter_pc_change, y = seasonal_rel_chg_in_sd)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("grey80", "goldenrod1", "deeppink1")) +
scale_fill_manual(values = c("grey80", "goldenrod1", "deeppink1"))
Still have extreme values here…
And not able to discern what cluster 2 is along these plotted axes
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
#filter(winter_pc_change < 1) %>%
ggplot(aes(x = winter_pc_change, y = winter_wkend_pc_change)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "springgreen2", "deeppink1")) +
scale_fill_manual(values = c("mediumblue", "springgreen2", "deeppink1"))
One extreme value, cluster 1, out at ~70 for winter_pc_change
Cluster 1 v 3 is wkend change difference (higher/lower than average in the scaled values)
Not clear what cluster 2 is doing here either
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
filter(.cluster == 2) %>%
ggplot(aes(x = summer_wkend_pc_change, y = winter_wkend_pc_change)) +
geom_point(size = 1, alpha = 0.2)
Similar to cluster 3
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
filter(.cluster == 2) %>%
filter(winter_pc_change < 1) %>%
ggplot(aes(x = winter_pc_change, y = seasonal_rel_chg_in_sd)) +
geom_point(size = 1, alpha = 0.2)
All cluster 2 are at no seasonal change
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
filter(.cluster == 2) %>%
filter(winter_pc_change < 1) %>%
ggplot(aes(x = winter_pc_change, y = winter_wkend_pc_change)) +
geom_point(size = 1, alpha = 0.2)
clustering %>%
unnest(cols = c(augmented)) %>%
filter(k <= 3) %>%
#filter(.cluster == 2) %>%
filter(winter_pc_change < 1) %>%
ggplot(aes(y = winter_pc_change, x = winter_mean_kwh)) +
geom_point(aes(colour = .cluster, fill = .cluster), size = 1, alpha = 0.2) +
scale_colour_manual(values = c("mediumblue", "springgreen2", "deeppink1")) +
scale_fill_manual(values = c("mediumblue", "springgreen2", "deeppink1"))
Cluster 2 has the extreme winter_pc_change value, rest are all ~-0.015 winter change and distributed about winter mean value (0-2) - so hidden in the main mass
df_clean_clustered <- left_join(df_clean, cluster_labels_scaled, by = join_by(lc_lid == .rownames))
df_clean_clustered
joined_all_clustered <- left_join(joined_all, cluster_labels_scaled, by = join_by(lc_lid == .rownames))
joined_all_clustered
library(tsibble)
joined_all_clustered %>%
ggplot() +
geom_line(aes(x = date, y = kwh, group = lc_lid), alpha = 0.01) +
facet_wrap(~ .cluster)
look at kmeans for these 3 clusters (from above) decide where to also remove / deal with extreme values (e.g. what if comparing to 0, have an upper limit e.g. just above the normal max value)
explain
try DB-SCAN method too
pick clustering method, reassign cluster number back to unscaled dataset
explain clusters
show timeseries per cluster
extension: consider ts forecasting (probabilistic!!) for each cluster extension: consider clustering by relationship with weather (e.g. coefficient for linear regression model for each household)